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1 SUPPLEMENTARY MATERIAL 

1.1 Algorithm: from alignments to compressed de 
Bruijn grpah 

In this section we present an algorithm that constructs a compressed 
de Bruijn graph from the set of self-alignments of length > k in 
the genome. We use mummer (Kurtz et al, 2004) to preprocess 
the genome and efficiently locate all self-alignments in the genome 
whose lengths are at least k. Our alignment-based algorithm begins 
with a graph consisting of one large node to represent the entire 
genome. Then, the algorithm considers one alignment at a time. As 
each alignment is incorporated into the graph, the nodes are split to 
represent smaller subsequences of the genome. Occasionally, nodes 
are merged when a repetition is detected in the genome. Thus our 
algorithm achieves a runtime that is related to the number of self- 
alignments bounded by k. The number of self-alignments shrinks 
rapidly as k grows. In contrast there is no such advantage to using 
large values of k in an uncompressed de Bruijn graph because the 
initial number of nodes is fixed by the genome size. 

Algorithm 2 depicts our alignment-based algorithm for constructing 
the compressed de Bruijn graph that represents a genome. We 
exclude implementation details that ensure the correctness of the 
algorithm. Each node captures a distinct subsequence of length > k 
in the genome. This is stored as a set of start positions and a length. 
We maintain separately a sorted set of all start positions in the 
graph, with pointers to the nodes that represent them, so that we 
can quickly navigate to a start position in the graph and easily query 
whether there is a node with a particular start position. Each distinct 
subsequence of length > k in the genome is represented by exactly 
one node in the compressed de Bruijn graph. Each fc-mer, which we 
denote by its start position in the genome, is included in exactly one 
node in the graph. This invariant is true in the final graph as well as 
during construction. In the final graph, there is one leaf, representing 
the end of the genome. However, during construction, we allow 
many nodes to be leaves, representing suffixes of the genome. At 
the end of construction, each leaf (except possibly the shortest one) 
has its sequence truncated and becomes a parent to the first node 
whose sequence begins within the leaf's sequence. 

We now summarize the procedure of our algorithm as it processes 
an alignment. We first insert the starting position of the first interval 
in the alignment, alignBegl, to the appropriate node and then 
we add the starting position of the second interval, alignBeg2, to 
the same node. If another node already represents alignBeg2, the 
nodes are merged. Before merging nodes that begin with identical 
subsequences, we ensure that they represent sequences of the same 
length and precede a merge by a split (splitBackwards) if one node's 
sequence is a proper prefix of the other's. When a starting position 
is added to a node, we ensure that the subsequence is removed 
from any other node that already captured it, splitting nodes as 
appropriate. Thus we ensure that there are no redundancies in the 
graph. 

When an alignment is considered, there are several scenarios that 
can occur when we insert alignBegl. 

1 . alignl.beg is a starting position of a node in the graph. 



existing node's. This uses the splitBackwards routine. Then 
a new start position of align2.beg is inserted to the new node 
[if it was not already there]. 

b. The existing node represents a shorter subsequence than 
the alignment. In this case, we insert the beginning of the 
alignment by inserting a new start position of align2.beg 
to the new node. Then the alignment is trimmed at its left 
end and we continue by iterating through the rest of the 
alignment. 

c. The existing node represents precisely the first interval 
of the alignment. In this case, align2.beg is added as a start 
position for the node. 

2. alignl.beg is not a starting position of any node in the 
graph. In other words, alignBegl is implicitly included within 
a node. 

a. The closest existing node with a start position before 
alignl.beg ends at alignl.end In this case, we use 
splitForwards to split the closest node with a start position 
less than alignl.beg into two nodes. Then align2.beg is 
inserted as a start position to the node that represents a suffix 
of the original node. 

b. The closest existing node with a start position before 
alignl.beg extends past alignl.end In this case, we use 
splitMiddle to split the closest node with a start position less 
than alignl.beg. This creates two new nodes. align2.beg is 
inserted as a start position to the new node that represents 
the middle of the original node. 

As the alignments are considered, nodes in the graph are 
merged and split. There are three ways in which a node is split, 
which we call splitBackwards, splitForwards, and splitMiddle. 
The splitBackwards routine is used when an alignment is a prefix 
to an existing node. It splits the existing node into two nodes. 
The splitForwards routine is used when an alignment is implicitly 
contained within an existing node, is not a prefix, and the alignment 
is a suffix of the existing node. It splits the existing node into 
two nodes. The splitMiddle routine is used when an alignment is 
implicitly contained within an existing node, is not a prefix, and 
the alignment ends earlier in the sequence than the existing node. It 
splits the existing node into three nodes. The splitting routines are 
depicted in Figure 1 . 

Self-overlapping alignments contributed additional complexity to 
our algorithm. Self-overlapping alignments are tandem repeats in 
the genome. We break down each self-overlapping alignment into 
its smallest repeating unit and create a node to capture the tandem 
repeat with all of its start positions. Then we create a separate node 
that bridges the occurrences of the tandem repeats, forming a cycle 
in the graph. We create an edge between these two nodes with 
multipfiicity to represent all recurrences of the tandem repeat. 



a. The existing node represents a longer subsequence than 
the alignment. In this case, we split the existing node to 
form a new node whose sequence is a proper prefix of the 



1 



Marcus et al 




Fig. 1. The three splitting routines in our alignment-based algorithm. splitBackwards splits a node representing a/3 into separate nodes for a and f3. splitMiddle 
splits a node representing a/37 mto separate nodes for a, j3 and 7. splitForwards splits a node representing a/3 into separate nodes for a and 0. Note that 
when a node representing the subsequence a/3 is split into separate nodes for a and /3, the overlapping k — 1 characters occur both at the end of the node for 
a and at the beginning of the node for (3. 
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Algorithm 2 Construct Compressed de Bruijn Graph from Alignments 
Input: genome sequence, k, set of self-alignments > k. 
Output: compressed forward de Bruijn graph of genome, 
for all lines in mummerOutputFile do 
if splitlnterval then 

> set align 1 and align2 to second part of self-ovlerap 

splitlnterval <— false 

else 

> load align 1 and align2 from input file 

if self overlapping alignment then 

[> split alignment to two parts 
> set align 1 and align2 to first part of self-overlap 

splitlnterval <— true 

else 

splitlnterval <i— false 
end if 
end if 

while ! intervallnserted do 

foundPos = findNodeBeginAtPos(alignl.beg) 
if foundPos ^ - 1 then 

foundNode <— nodes [foundPos] 

if foundNode.length > alignLength then 

D> foundNode is too long 

splitBackwards(foundNode, alignLength) 
intervallnserted <— true 
else if foundNode.length < alignLength then 

o foundNode is too short 

incToNextBegin <s— foundNode.length — k +1 
align l.beg+= incToNextBegin 
align2.beg+= incToNextBegin 
intervallnserted <— false 

else 

> first interval is represented by foundNode 

intervallnserted <s— true 
end if 

else 

> align 1. beg not found, implicitly included in a node 

lastNode <— closest node with start before align 1. beg 
if align Lend is end of lastNode then 

foundNode «— splitNodeForward(lastNode, align l.beg) 

else 

foundNode «— splitNodeMiddle(lastNode, align l.beg, align l.length) 
end if 

> foundNode represents align.beg 

createChild(newNode, align.beg) 
intervallnserted true 
end if 

addedStart «— addStartPosToNode(foundNode, align2.beg) 
if intervallnserted and addedStart then 
createChild(foundNode, align2.beg) 
end if 
end while 
end for 
updateLeavesQ 
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Algorithm 3 Construct Repeat Nodes from MEM nodes in suffix tree in 0(n log n) time and space 
1: procedure CREATERepeatNodesFromSuffixTree > recursive DFS of suffix tree 

2: CREATEREPEATNODESFROMMEM(root) 

3: end procedure 

4: procedure CREATEREPEATNoDEsFROMMEM(node) 

5: for all node children do 

6: CREATEREPEATNODESFROMMEM(node.child) 

7: end for 

8: if node.MEMthen 

9: if node.parent / root then > include path from root to MEM node 

10: extend node label left to include path label from root 

11: end if 

12: while node.strdepth > k do 

13: LMAnode «- node.LMA 

14: if LMAnode 7^ null then > skip LMAnode. strdepth characters 

15: if skippedChars then 

16: createRepeatNode for skipped segment of MEM 

17: end if 

18: numCharsToSkip «- LMAnode. strdepth — k + 1 

19: end if 

20: node <— node.suffixSkipsfO] 

21: if numCharsToSkip > 0 then > use suffix skips to traverse numCharsToSkip suffix links quickly 

22: numCharsToSkip 

23: if node.MEMthen 

24: break 

25: end if 

26: while numCharsToSkip > 0 do 

27: slinklndex <— Hoor(log(numCharsToSkip) / log(2)) 

28: slinkTraversing ^— pow(2, slinklndex) 

29: if node.closestLMA[slinkIndex] / null then 

30: if node.closestLMAproximity[slinkIndex] < numCharsToSkip then 

3 1 : adjust numCharsToSkip to extend over skipped LMA 

32: end if 

33: end if 

34: node node.suffixSkips[slinkIndex] 

35: numCharsToSkip — = slinkTraversing 

36: end while 

37: end if 

38: end while 

39: if needLastNode then 

40: createRepeatNode for overhang beyond last embedded MEM 

41: end if 

42: end if 

43: end procedure 
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Fig. 2. Example suffix tree and suffix skips for the string "babab$". For clarity, only a subset of the suffix links and skips are displayed. Leaf nodes with $ 
characters are also not shown. 
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Table 1. The 9 E. coli and 9 B. anthmcis strains included in our pan-genome 
analysis. 



Strain 


Size 


Accession 




B. 


anthracis A0248 uid33543 


5178 KB 


CP001598 


B. 


anthracis A16R uid40353 


5179 KB 


CP001974 


B. 


anthracis A16 uid40303 


5179 KB 


CP001970 


B. 


anthracis Ames 0581 uidl0784 


5178 KB 


AE017334 


B. 


anthracis Ames uid309 


5178 KB 


AE016879 


B. 


anthracis CDC 684 uid31329 


5181 KB 


CP001215 


B. 


anthracis CI uid36309 


5147 KB 


CP001746 


B. 


anthracis H9401 uid49361 


5170 KB 


CP002091 


B. 


anthracis str Sterne uidl()878 


5180 KB 


AE017225 




E. 


coli 0127 H6 E2348 69 uid32571 


4919 KB 


FM 180568 


E. 


coli 042 uid40647 


5193 KB 


FN554766 


E. 


coli 536 uidl6235 


4893 KB 


CP000247 


E. 


coli 55989 uid33413 


5107 KB 


CU928145 


E. 


coli ABU 83972 uid38725 


5083 KB 


CP001671 


E. 


coli APECOl uidl6718 


5034 KB 


CP000468 


E. 


coli APEC 078 uidl84588 


4753 KB 


CP004009 


E. 


coliBL21 DE3 uid20713 


4516 KB 


CP001509 


E. 


coli BL21 DE3 uid28965 


4516 KB 


AM946981 
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Fig. 3. The compressed de Bruijn graph for the B. anthracis pan genome with k=25 artistically rendered in Gephi using the ForceAtlas 2 placement algorithm. 
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Fig. 4. The compressed de Braijn graph for the B. anthracis pan genome with k=100 artistically rendered in Gephi using the ForceAtlas 2 placement algorithm. 
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Fig. 5. The compressed de Bruijn graph for the B. anthracis pan genome with k=1000 artistically rendered in Gephi using the ForceAtlas 2 placement 
algorithm. 
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Fig. 6. The compressed de Bruijn graph for the E. coli pan genome with k=25 artistically rendered in Gephi using the ForceAtlas 2 placement algorithm. 
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Fig. 7. The compressed de Bruijn graph for the E. coli pan genome with k=100 artistically rendered in Gephi using the ForceAtlas 2 placement algorithm. 
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Fig. 8. The compressed de Bruijn graph for the E. coli pan genome with k=l()0() artistically rendered in Gephi using the ForceAtlas 2 placement algorithm. 
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Node Length Histogram for B. anthracis (k=25) 
mean=1 22 +/- 947 max=1 30320 



Node Length Histogram for E. coli (k=25) 
mean =51 +f- 202 max =3 0711 
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Node length 




Node Length Histogram for B. anthracis (k=100) 
mean =3 82 +/- 4501 max =451 679 



Node Length Histogram for E. coli (k=100) 
mean =194 +/- 394 ma x -40 if 3 H 





Node Length Histogram for B. anthracis (k=1000) 
mean =33 23 +/- 22693 max=1 199923 



Node Length Histogram for E. coli (k=1000) 
mean =4 145 +/- 9732 max=346213 
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Fig. 9. Distributions of node lengths in the compressed de Bruijn graphs for the pan-genomes of 9 strains of E. coli and 9 strains of B. anthracis. 
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Distance to Core Genome for B. anthracis (k=25, core%=70) 
mean =3 +/- 88 max=1 000 



400 600 
Distance to care genome 



Distance to Core Genome for E. coli (k=25, core%=70) 
mean=167 +/- 366 max =1000 
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Distance to Core Genome for B. anthracis (k=100, core%=70) 
mean =2 +/- 35 max=1 000 



Distance to Core Genome for E. coli (k=100, core%=70) 
mean =369 +/- 435 max =1000 
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Distance to Core Genome for B. anthracis (k=1000, core%=70) 
mean =3 2 +/- 143 max =1000 



200 400 600 

Distance to care genome 



Distance to Core Genome for E. coli (k=1000, core%=70) 
mean =9 80 +/- 132 max =1000 
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Fig. 10. Distributions of distances to the core genome in the compressed de Bruijn graphs for the pan-genomes of 9 strains of E. coli and 9 strains of B. 
anthracis. 
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